Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey
نویسندگان
چکیده
Abstract With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative transformers (GPT), etc. Inspired by success of these in single domains (like computer and natural language processing), multi-modal have also drawn more attention recent years. In this work, we give a comprehensive survey hope paper could provide new insights helps fresh researchers to track most cutting-edge works. Specifically, firstly introduce background pre-training reviewing conventional learning, works process, vision, speech. Then, task definition, key challenges, advantages (MM-PTMs), discuss MM-PTMs with focus on data, objectives, network architectures, knowledge enhanced pre-training. After that, downstream tasks used validation large-scale MM-PTMs, including generative, classification, regression tasks. We visualization analysis model parameters results representative Finally, point out possible research directions topic that may benefit future addition, maintain continuously updated list models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey .
منابع مشابه
Modal Consistency based Pre-Trained Multi-Model Reuse
Multi-Model Reuse is one of the prominent problems in Learnware [Zhou, 2016] framework, while the main issue of Multi-Model Reuse lies in the final prediction acquisition from the responses of multiple pre-trained models. Different from multiclassifiers ensemble, there are only pre-trained models rather than the whole training sets provided in Multi-Model Reuse configuration. This configuration...
متن کاملEfficient Large-Scale Multi-Modal Classification
While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantiti...
متن کاملOperational planning of a large-scale multi-modal transportation system
We describe the operational planning system POP developed for Danzas Euronet, a merger of Deutsche Post Transport and Danzas NTO. As of November 1997, the system has been daily used for the transport planning of on average 4000 container-orders a day on train and vehicles in Germany. Important features include the involvement of many practical constraints as well as the requirement to balance t...
متن کاملMulti-modal Analysis of Music: A large-scale Evaluation
Multimedia data by definition comprises several different types of content modalities. Music specifically inherits e.g. audio at its core, text in the form of lyrics, images by means of album covers, or video in the form of music videos. Yet, in many Music Information Retrieval applications, only the audio content is utilised. Recent studies have shown the usefulness of incorporating other moda...
متن کاملA Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Machine Intelligence Research
سال: 2023
ISSN: ['2731-538X', '2731-5398']
DOI: https://doi.org/10.1007/s11633-022-1410-8